home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Software Vault: The Diamond Collection
/
The Diamond Collection (Software Vault)(Digital Impact).ISO
/
cdr11
/
gensrch.zip
/
GENSRCH.DOC
< prev
next >
Wrap
Text File
|
1994-10-24
|
27KB
|
615 lines
GENSRCH
REVISION 2.0.1
This document is broken up into several sections:
INTRODUCTION
COPYRIGHT
GENSERV
DISADVANTAGES
PROGRAM DESCRIPTIONS
INSTALLATION
HOW TO USE WITH A COLLECTION OF GEDCOM FILES
USING WITH DATA ON A CD-ROM
DEMOS
WHAT TO EXPECT IN REAL LIFE
WHAT'S A GEDCOM FILE
WHAT'S A SOUNDEX
THE AUTHOR
INTRODUCTION
A set of tools for genealogical research using gedcom files. Lets
you search for common ancestors between different gedcom files. If
you don't know what a gedcom file is, look at the end under "What's
a gedcom file".
I guess the best way to explain it is a simplified example. I'll
leave some of the set up steps explained later out, just to give you
the concept. By the way, this example is a true story.
Let's say, you belong to a genealogy society (club) and have a
collection of gedcom files from many of the people in the club. You
want to find out if any of the club members have common ancestors
with you, or between each other. After some initial setup which is
done when you add a new gedcom file to your collection, you issue
the command:
gensrch Your_database_name *.ndx
Or to send the results to a file instead of your screen:
gensrch Your_database_name *.ndx > results
Your_database_name is typically the name of your gedcom file when
you set up your total database of gedcom files.
The results look something like this:
Search for matches to database johns1
==============================================================================
LAST, First INDI# Spouse name SNDX Birthdate Deathdate Database
----------------- ------ ----------------- ---- ----------- ----------- -------
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Possible match for
Mathis, Frances 495 Coleman, R M320 johns1
----------------------------------------
MATHIS, Frances 1902 COLEMAN, R Sr. M320 20 Feb 1749 1809 coleman2
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Possible match for
SINGLETARRY, LYDIA 349 LADD, DANIEL S524 johns1
----------------------------------------
SINGLETERY, Lydia 456 LADD, Daniel S524 30 Apr 1648 pricej1
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Possible match for
Thayer, Cicely 394 DAVIS, James T600 28 May 1673 johns1
----------------------------------------
THAYER, Cicely 255 T600 1595 28 May 1673 thayer
My gedcom file's name is johns1.ged, so my database name is also
johns1. The report says that in johns1.ged there is a person
(Mathis, Frances) who looks to be the same as in the coleman2.ged.
Also, I match people in pricej1.ged and thayer.ged.
Once you know this, you can load coleman2.ged in your genealogy
program (PAF, ROOTS, BK, etc.) and look to see if the coleman2
database goes back further than yours. I suppose you could even
talk to Coleman, but that's rather archaic don't you think?
Notice that I didn't tell the program to look for Mathis, or
Thayer's. It looked at each individual that came from my gedcom
file, and looked for a match against each individual that came from
other gedcom files. It does soundex compares on the names, so exact
spelling is not required. It does approximate matching on dates. It
understands abbreviations and can match an abbreviated name to a
fully spelled name. If you don't know what soundex is, look at the
end under "What's a soundex".
COPYRIGHT
Gensrch is copyrighted software. However, you are encouraged to
copy and share it. I place no restrictions on it's use for
non-profit people and organizations.
However, if it is used for commercial purposes, I want a piece of
the action. It would be nice to break even.
GENSERV
Genserv (not gensrch) is a system on the internet that was started
by Cliff Manis. It is his own collection of gedcom files along with
utilities to access them. People like you and I, can access
genserv's database in a similar manner to what you do with your
local database.
What's the price? A copy of your gedcom file. That's all. No
money. You have just added to the value of genserv as a research
tool by adding your gedcom file.
The genserv system was the reason my gensrch software was developed,
and portions of it have been ported to his genserv machine. Almost
everything you can do with my gensrch software, you can do via email
with genserv, and with a much larger collection of gedcom files.
Genserv, like gensrch is a free service, and I would like to
encourage anyone with internet access to join the genserv crowd.
At the time this document was written it is being moved to
Genserv@GenTech.Org. By the time you read this, is should be
up and running again. Send mail to Genserv@GenTech.Org for
requesting material about the server.
DISADVANTAGES
With a large local collection of gedcom files, no matter how you
work it, it's a lot of data to wade through, and a slow process.
Fortunately you don't have to be there. Go to lunch.
The larger your collection, the more disk space you need for it.
PROGRAM DESCRIPTIONS
All of these programs will give a fairly large help screen if you
just invoke them with no parameters. All options flags will be
displayed.
1. ged2srch.exe
Scans a gedcom file, and generates one line of information about
each person in it, like this:
CORLISS, Ann 237 ROBIE, John C642 8 Nov 1657 16 Jun 1691 johns1
It contains several fields. The first is the persons name,
CORLISS, Ann.
Next is a number that just indicates when he/she was encountered
in the gedcom file. It sometimes is the same as the rin number
used by your genealogy program.
The third field is the spouses name.
Next is the soundex code for the person.
Next is birth date, and death date.
Finally, the database name. There are two ways you as the
administrator of this database can decide the database name. The
easiest is to use the gedcom file name decide it with the -g
option.
ged2srch -g *.ged > tmp
This command will scan all the gedcom files you have in this
directory, and generate one liners for each person in each file
and the database name will be the gedcom file name minus the
".ged".
If you don't use the -g option, you must specify a database
name.
ged2srch johns1 c_demo.ged > tmp
will generate data with the database name johns1.
2. brkmail.exe
Breaks the possibly large file generated by ged2srch into a
bunch of smaller files called a.ndx, b.ndx, ... z.ndx. Each
containing surnames with the same starting letter as the
starting letter of the file. It's called brkmail because I used
to get this information from genserv by email and had to BReaK
the MAIL messages up into these files.
3. srtrpt.exe
Sorts the a.ndx ... z.ndx files. Puts them into soundex order,
and deleted duplicate lines. This makes gensrch run faster, but
is not absolutely necessary. None of the sorts done by these
programs have a memory limitation. As long as there is disk
space for the temporary files necessary there should be no
problems with large file. Of course the bigger they are, the
longer it takes to sort.
4. gensrch.exe
The final report generator. Searches for matches. Several
options of interest. You can specify how close dates must
match, plus or minus days, months, years.
You can specify how close the names must match. That one takes
some explaining. All names, both first and last are tested with
soundex compares, not string compares. Soundex is a neat thing
because it allows slight changes in the spelling (Corliss and
Corlisse) to still match. Sometimes though, it can be to
lenient. For example CROWELL and CURLESS have the same soundex
code. The -F x specifies how many letters the spelling may
differ. I like to use -F 3.
The -M option is nice if you are getting lots of matches. It
only shows matches with More than me. In other words, if it
finds a match that has dates or spouse names, etc. that your
data does not have, it will display this match. It will not
display a match if it appears that you have all the data the
other has.
The -g option allows you to search a gedcom file that you
haven't added to your ndx files yet, and check it against them.
For example, your neighbor brings over his gedcom file, and you
want to do a quick check to impress him before going through all
the steps to add him permanently to your database.
gensrch -g c_demo.ged *.ndx
For optimization reasons this option is more picky. It will
only search an ndx file if it's name starts with the same letter
as the surnames it is looking for. That is why, for instance,
c_demo.ndx is named as it is, starting with a c. That's what
brkmail does anyway, so it should be no problem.
One common mistake with the -g option is to use a gedcom file
that has already had its data merged into the ndx files. This
will result in a zillion matches between the gedcom data and the
duplicate data already in the ndx files.
The -g option causes gensrch to create a database name for the
new gedcom file in upper case. fred.ged will result in a
database name of "FRED", not "fred". Normally the database
names in the ndx files are lower case. This is so you won't
have to be carefull what the name of your new gedcom file is.
Gensrch will generate one ndi file for each ndx file it
searches. This is to make searches run faster. It will remake
the ndi file if it not there, or it detects that the ndx file
has been updated by checking the dates of the two files. After
running the demo, "gensrch johns1 c_demo.ndx", you will find a
c_demo.ndi file now exists.
Gensrch is being proposed for a Non profit CD-ROM project
(Acadian), and one option was added for that environment. The
-I (Upper case) option makes it ignore the dates for the ndi
file. This is just paranoia on my part. I was concerned that
the dates might not get installed properly on the CD, and the
program would choke each time because it could not do anything
about it.
5. combsrch.exe
A pretty printer for gensrch. Takes a gensrch report like this:
Possible match for
CORLISS, Hildah 11 C642 18 Nov 1661 C_DEMO
----------------------------------------
CORLISS, Hildah 240 C642 18 Nov 1661 johns1
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Possible match for
CORLISS, Hildah 11 C642 18 Nov 1661 C_DEMO
----------------------------------------
CORLISS, Hulda 508 KINGSBURY, S C642 18 Nov 1661 1720 pricej1
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Possible match for
CORLISS, Hildah 11 C642 18 Nov 1661 C_DEMO
----------------------------------------
CORLISSE, Hulda 239 KINGSBURY, S C642 18 Nov 1661 26 Sep 1698 johns1
Which is 3 different matches to the same Hildah Corliss, and
combines the matches so that C_DEMO's person CORLISS, Hildah is
mentioned once like the following.
- - =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Possible match[s] for
CORLISS, Hildah 11 C642 18 Nov 1661 C_DEMO
----------------------------------------
CORLISS, Hildah 240 C642 18 Nov 1661 johns1
CORLISS, Hulda 508 KINGSBURY, S C642 18 Nov 1661 1720 pricej1
CORLISSE, Hulda 239 KINGSBURY, S C642 18 Nov 1661 26 Sep 1698 johns1
It can be used to process the output of gensrch in a file like this:
gensrch johns1 c_demo.ndx > results
combsrch results > results2
Or it can be used as a filter, eliminating the two step process like this:
gensrch johns1 c_demo.ndx | combsrch > results
5. soundex.exe
Just a utility to echo the soundex code for a name. example:
soundex smith
Soundex for smith is S530
6. deletedb.exe
Scans through ndx files, and deletes the specified database.
For instance you could delete all data from johns1, leaving all
other data intact. This lets you delete all of johns1 gedcom
data so you can replace it with a new copy without having to
recreate the whole database.
7. cleanrpt.exe
Scans files with ndx type data, and prints the valid report ndx
style lines. This has the effect of stripping out any mail
headers, etc. If you are creating all your own data locally,
and not getting it from genserv, you won't need this.
8. Surnames.exe
Since I belong to a genealogical society which wants a surname
list, I cranked this out to generate a surname list from the
gedcom files. It generates a list like the following which I
format into a multi column report with my word processor.
COOPER 11 johns1
CORLISS 2 corliss
CORLISS 1 johns1
DALTON 5 johns1
DAVIDSON 6 johns1
DAVIS 15 corliss
DAVIS 999 johns1
DAY 5 johns1
Note that even if johns1 has a hundred Corliss's, it will only
show up in this list once. The number is the number of times
that surname was encountered, up to a max of 999. I wanted to
get a multi column report with my word processor, so I had to
put a limit somewhere. Anything over 999 is just lots.
9. c_demo.ged
A demo gedcom file. See the demo section.
10. c_demo.ndx
A demo ndx file. See the demo section.
INSTALLATION
Not much to it. You can put the programs in your current directory
and just run them there, or do the following.
Put the programs where you put your other utilities. Under DOS, the
command "path" will print out something like this:
PATH=C:\BIN;C:\DOS;C:\WINWORD;C:\EXCEL;C:\WINDOWS
Each part of that statement separated by semicolons is a directory
that is searched for programs each time you type a command on the
DOS command line. In the above case, if you tried to run the DOS
editor by typing the command "edit gensrch.doc", DOS would look for
edit in the c:\bin directory, then in c:\dos directory where it
would finally find it.
Any directory included in your path will do fine although in the
above case dos, winword, excel, and windows should be avoided just
to keep everything clean. The path definition is normally defined
in your autoexec.bat, and you can add directories if you wish.
The gensrch will search for the environmental variable "TMP" or
"TEMP" for a place to put temporary files. For example, in your
autoexec.bat
Set TMP=C:\tmp
or
Set TEMP=C:\tmp
Sets this variable. Don't forget to create the directory.
If you don't have the variable defined, the temporary files will
just end up in your current directory. Normally they are deleted
when the program exits, except when you control c out of a program,
they will be left behind.
You can see if it is defined by the DOS command "set". It will dump
all the environmental variables to the screen. You can browse
through them looking for this variable.
HOW TO USE WITH A LOCAL COLLECTION OF GEDCOM FILES
1. Place a copy of all your gedcom files in one directory.
2. ged2srch -g -v *.ged > tmp
Creates the style of reports required by gensrch from the gedcom
files in the file tmp.
3. brkmail tmp
Takes the tmp file, and breaks it up into a.ndx, b.ndx, ...
z.ndx using the first letter of the surname. You might want to
run brkmail in a different directory than the one you keep your
gedcom files, to keep things from getting cluttered.
4. When you have all your index files built:
gensrch your_database_name *.ndx > matches
in my case it's:
gensrch johns1 *.ndx > matches
The -M "More than me" option will create a much smaller matches
file. The -p "Progress" option will send the reports to the
screen as well as to the matches file.
5. Take a coffee break :-)
6. Browse through the matches file. Hopefully, it will have found
other people's data who are searching your line, and often have
dates, etc. that you don't.
Something like this. Note, I am johns1. Looks like pricej1 has
some data I am missing. Think I'll send him some mail.
Search for matches to database johns1
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Possible match for
AYER, Joseph 368 CORLISS, Sarah A600 johns1
----------------------------------------
AYER, Joseph 516 CORLISS, Sarah A600 1660 1710 pricej1
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Possible match for
BROWN, Abigail 329 HARTSHORN, John B650 johns1
----------------------------------------
BROWN, Abigail 575 HARTSHORN, John B650 1694 pricej1
Delete obvious non matches.
USING WITH DATA ON A CD-ROM
Some of the features on this release were developed to work with
Yvon L. Cyr's Acadian/French Canadian CD-ROM project. This project
is basicly a contribution of anyone's gedcom file who has
Acadian/French Canadian ancestors. The gedcom files will be on the
CD-ROM. There will probably be gensrch results files showing
matches between people who submitted to the CD-ROM on the CD-ROM.
The a.ndx - z.ndx and associated ndi files will be on the CD-ROM as
generated by ged2srch, brkmail, and srtrpt.
Those who submitted to the CD-ROM will be able to view their match
results with any ascii editor since the matches will be on the
CD-ROM.
But what about those of us who didn't get our gedcom files on the
CD-ROM? Is it useful to us? Yes! That's where the gensrch -g
option comes in. With the gensrch -g option, you can scan for
matches between your newly created gedcom file you made at home,
with all the gedcom files on the CD-ROM.
There are a few problems and their solutions you must be aware of
first when working with CD-ROM based data. One of which is, a
CD-ROM is slow. Sorry, Can't do much about that. Another is that a
CD-ROM contains huge amounts of data. Lots of data takes lots of
time to scan. Be patient. Go to lunch. Go to bed. Check it in
the morning.
The following rules for working with CD-ROM based data would also
hold true with any write protected data such as that on a write
protected floppy.
The set of gensrch utilities create temporary files while they are
doing their work. If the environmental variable "TEMP", or "TMP"
are defined, they tell the program where to put these temporary files.
If they are not defined, the temporary files end up in the current
directory. If that current directory happens to be the CD-ROM,
things just won't work, so see the section on INSTALLATION.
There are way's around this problem without messing with "TEMP" or
"TMP" if you wish. Let's say for example that your CD-ROM drive is
g: and your regular hard drive is c:. Lets also say that you are in
a directory on your hard drive c: called "george", or anything else
you want to call it. From c:\george, you issue the command:
gensrch -g myged.ged g:*.ndx > results.txt
This searches all the ndx files on the g:CD-ROM for matches to your
c:gedcom file, and puts the results in your c:results.txt. Note
that in this case, your current directory is c:\george, which is not
write protected, and temporary files can be created without
problems.
With "TEMP" or "TMP" defined properly you could work from the CD-ROM
directly. For example:
g:
gensrch -g c:myged.ged *.ndx > c:results.txt
In this case the current directory is on the CD-ROM drive, but
"TEMP" tells the program to put the temporary files in a place
typically like c:\temp. No problem. Note that you had to specify a
writable destination for your results.
If you get some sort of error, check to see if you are trying to
create files on the CD-ROM, which of course you cannot do.
DEMOS
There are two files included with the package that are only there
for demo purposes. c_demo.ndx and c_demo.ged
c_demo.ndx is the type of data you would get after running ged2srch
against your gedcom files, and brkmail against the output of
ged2srch.
It is an ascii file, so you can look at it with any ascii editor.
Of course I cherry picked data that would contain matches, but
that's what demos are about.
To try it out, type the command:
gensrch johns1 c_demo.ndx
It should dump a bunch of matches to the screen. The same command
followed by the pipe to a file syntax "> results", like this:
gensrch johns1 c_demo.ndx > results.txt
Will get the match data into the file results.txt which you can
print or look at with any ascii editor.
Once you have a large collection of gedcom files merged into ndx
files, you might run into the situation where someone brings you
their gedcom file, and you want to run a quick check for matches
without going through all the ged2srch, brkmail steps. The -g
option does this.
gensrch -p -g c_demo.ged c_demo.ndx > results.txt
First it generates ndx style data from your gedcom file, then it
checks this new data against your old ndx files. The -p option sent
a second copy of the match information to your screen so you could
tell something was happening.
Try this one:
gensrch -p -g c_demo.ged c_demo.ndx | combsrch > results2.txt
This did the same as the previous gensrch, but ran it through a
"Pretty Printer". Look at the difference between results.txt, and
results2.txt.
Actually, this -g option was developed to allow searching ndx files
that were placed on a CD. You can't add your gedcom data to the
CD-ROM's data, so you must use the -g option.
WHAT TO EXPECT IN REAL LIFE
Not much at first! Remember, there are a lot of people out there
who are NOT your ancestors. The odds against your neighbors gedcom
file containing the same ancestors as yours are very high. The
trick is to collect a lot of gedcom files, and reduce the odds.
Unfortunately, a lot of gedcom files, and the resultant data
generated from them eats disk space. Also the bigger the
collection, the more time it takes to manage it, so be patient.
Matches are out there, and you might hit real pay dirt. All the
demo files contain real matches that I found on the genserv system,
which is just a big collection, and other than big, is no different
than the one you are now thinking of gathering.
WHAT'S A GEDCOM FILE
There are a lot of programs now that have the sole purpose of making
it easier to maintain genealogy information on your ancestors. One
problem with them, is none of them store their data in the same format.
How do you get data from your cousin back east who uses Roots, and
you use PAF? The ancestor is a gedcom file. All of the better programs
will read and write a gedcom file.
It's merely an ascii file you can look at with any ascii editor, but
it is layed out in a strict set of rules that most of these programs
stick to.
You can save all your ancestor information from Roots, or Brothers
Keeper to a gedcom file, and restore it all into Paf, etc.
It is not designed to be used by a database management program to
maintain your ancestors information. It would be extremely slow
for that purpose.
It's designed to be a way to exchange information.
WHAT'S A SOUNDEX
A soundex code is a way of representing a name that isn't to critical
about how the name is spelled. It is an attempt to come up with
a number that represents how a name sounds. If two names sound
close, in theory they should have the same soundex code.
For example the surname Smith has a soundex code of S530. The
surname Smyth has the same soundex code.
Many times in genealogical work, you will find surnames spelled
slightly differently between generations, and when the spelling
skills of the ancestor were poor, even the ancestor would spell
his name several different ways.
Soundex helps detect these slight variations, but of course when
working with those crazy humans, even soundex isn't enough. A good
example is my Moberly ancestors who switch back and forth between
MOBERLY (M164) and MOBLEY (M140). Gensrch will miss these. Sigh.
THE AUTHOR
John Smith
28032 Singleleaf
Mission Viejo
California USA 92692
jsmithii@netcom.com
johns@FileNet.com